Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization
Xuewen LIU, Jikui WANG, Zhengguo YANG, Qiang LI, Jihai YI, Bing LI, Feiping NIE
Journal of Computer Applications    2022, 42 (5): 1455-1463.   DOI: 10.11772/j.issn.1001-9081.2021050736
Abstract282)   HTML5)    PDF (1551KB)(73)       Save

It is an effective hybrid strategy for imbalanced data classification of integrating cost-sensitivity and resampling methods into the ensemble algorithms. Concerning the problem that the misclassification cost calculation and undersampling process less consider the intra-class and inter-class distributions of samples in the existing hybrid methods, an imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization was proposed, named Boosting algorithm based on Ball Cluster Partitioning and UnderSampling with Density Peak optimization (DPBCPUSBoost). Firstly, the density peak information was used to define the sampling weights of majority samples, and the majority ball cluster with “neighbor cluster” was divided into “area misclassified easily” and “area misclassified hardly”, then the sampling weight of samples in “area misclassified easily” was increased. Secondly, the majority samples were undersampled based on the sampling weights in the first iteration, then the majority samples were undersampled based on the sample distribution weight in every iteration. And the weak classifier was trained on the temporary training set combining the undersampled majority samples with all minority samples. Finally, the density peak information of samples was combined with the categorical distribution of samples to define the different misclassification costs for all samples, and the weights of samples with higher misclassification cost were increased by the cost adjustment function. Experimental results on 10 KEEL datasets indicate that, the number of datasets with the highest performance achieved by DPBCPUSBoost is more than that of the imbalanced data classification algorithms such as Adaptive Boosting (AdaBoost), Cost-sensitive AdaBoost (AdaCost), Random UnderSampling Boosting (RUSBoost) and UnderSampling and Cost-sensitive Boosting (USCBoost), in terms of evaluation metrics such as Accuracy, F1-Score, Geometric Mean (G-mean) and Area Under Curve (AUC) of Receiver Operating Characteristic (ROC). Experimental results verify that the definition of sample misclassification cost and sampling weight of the proposed DPBCPUSBoost is effective.

Table and Figures | Reference | Related Articles | Metrics